A First Empirical Study of Emphatic Temporal Difference Learning
نویسندگان
چکیده
In this paper we present the first empirical study of the emphatic temporaldifference learning algorithm (ETD), comparing it with conventional temporaldifference learning, in particular, with linear TD(0), on on-policy and off-policy variations of the Mountain Car problem. The initial motivation for developing ETD was that it has good convergence properties under off -policy training (Sutton, Mahmood & White 2016), but it is also a new algorithm for the on-policy case. In both our on-policy and off-policy experiments, we found that each method converged to a characteristic asymptotic level of error, with ETD better than TD(0). TD(0) achieved a still lower error level temporarily before falling back to its higher asymptote, whereas ETD never showed this kind of “bounce”. In the off-policy case (in which TD(0) is not guaranteed to converge), ETD was significantly slower. 1 Emphatic Temporal Difference Learning We consider the problem of learning the value function for a Markov decision process and a given policy. An agent and environment interact at discrete time steps, t = 0, 1, 2, . . ., at each of which the environment is in a state St, the agent selects an action At and as a result the environment emits a reward Rt+1 and a next state St+1. States are represented to the agent as feature vectors φt = φ(St) ∈ R. We seek to find a parameter vector, θt ∈ R such that the inner product θ> t φt approximates the expected return E [ Rt+1 + γRt+2 + γ Rt+3 + · · · | At:∞ ∼ π ] , where π : A × S → [0, 1] is a policy for selecting the future actions. In fact, all actions are selected by an alternate policy μ. If π = μ, then the training is called on-policy, whereas if the two policies are different the training is called off-policy. We consider the special case of the emphatic temporal difference learning algorithm (ETD) in which bootstrapping is complete (λ(s) = 0,∀s) and there is no discounting (γ(s) = 1,∀s). Studying TD and ETD methods with complete bootstrapping is suitable because in this case the differences between them are maximized. As λ approaches 1, the methods behave more similarly up to the point where they become equivalent when λ = 1. By setting λ = 0 and γ = 1, the ETD algorithm can be 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. ar X iv :1 70 5. 04 18 5v 1 [ cs .A I] 1 1 M ay 2 01 7 completely described by: θt+1 . = θt + αρtFt ( Rt+1 + θ T t φt+1 − θ t φt ) φt, Ft . = ρt−1Ft−1 + 1, with F0 . = 1, ρt . = π(At|St) μ(At|St) , where α > 0 is a step size parameter. F is the followon trace according to which the update at each time step is emphasized or de-emphasized. TD is obtained by removing the F from the first equation. Because of F , ETD is different from TD even in the on-policy case in which ρ is always 1. For a thorough explanation of ETD see (Sutton, Mahmood & White 2016). 2 Stability of On-policy TD with Variable λ: A Counterexample In this section we show that although the initial motivation for developing ETD was that it has good convergence properties under off-policy training (Yu 2015), it is also a different algorithm under on-policy training. To emphasize the difference between the two, we present a simple example for which TD(λ) is not convergent under on-policy training but ETD is. It has long been known that TD(λ) converges with any constant value of λ under on-policy training (Tsitsiklis & Van Roy 1997). Surprisingly, TD(λ) is not assured to converge with varying λ even under on-policy training. Yu has recently presented a counterexample (personal communication) with state dependent λ for which on-policy TD(λ) is not convergent. The example is a simple Markov decision process consisting of two states in which the system simply moves from one state to another in a cycle. The process starts in each of the states with equal probability. Let λ(S1) = 0 and λ(S2) = 1, φ(S1) = (3, 1) and φ(S2) = (1, 1) and γ = 0.95. As shown below, the TD(λ) key matrix for this problem is not positive definite. Moreover, both eigenvalues of the key matrix have negative real parts and thus TD(λ) diverges in this case. S1 S2 Key matrix = ( −0.4862 0.1713 −0.7787 0.0738 ) This is while ETD is convergent under both off-policy and on-policy training with variable λ. This example appears in more detail in the supplementary material. 3 Fixed-policy Mountain Car Testbed For our experimental study, we used a new variation of the mountain car control problem (Sutton & Barto 1998) to form a prediction problem. The original mountain car problem has a 2-dimensional space, position (between -1.2 and 0.6), and velocity (between -0.07 and 0.07) with three actions, full throttle forward, full throttle backward, and 0 throttle. Each episode starts around the bottom of a hill (a uniform random number between -0.4 and -0.6). The reward is -1 on all time steps until the car pasts its goal at the top of the hill, which ends the episode. The task is undiscounted. Our variation of the mountain car problem has a fixed target policy which is to always push towards the direction of the velocity and not to push in any direction when the velocity is 0. We call the new variation of the mountain car problem, the fixed-policy mountain car testbed. The performance measure we used is an estimation of the mean squared value error (MSVE) which reflects the mean squared difference between the true value function and the estimated value function, weighted according to how often each state is visited in the state space following the behavior policy:
منابع مشابه
Emphatic Temporal-Difference Learning
Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps. Recent works by Sutton, Mahmood and White (2015), and Yu (2015) show that by varying the emphasis in a particular way, these algorithms become stable and convergent under off-policy training with linea...
متن کاملOn Convergence of Emphatic Temporal-Difference Learning
We consider emphatic temporal-difference learning algorithms for policy evaluation in discounted Markov decision processes with finite spaces. Such algorithms were recently proposed by Sutton, Mahmood, and White (2015) as an improved solution to the problem of divergence of off-policy temporal-difference learning with linear function approximation. We present in this paper the first convergence...
متن کاملSome Simulation Results for Emphatic Temporal-Difference Learning Algorithms
This is a companion note to our recent study of the weak convergence properties of constrained emphatic temporal-difference learning (ETD) algorithms from a theoretic perspective. It supplements the latter analysis with simulation results and illustrates the behavior of some of the ETD algorithms using three example problems.
متن کاملGeneralized Emphatic Temporal Difference Learning: Bias-Variance Analysis
We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm (Sutton, Mahmood, and White, 2015), which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where o...
متن کاملO2TD: (Near)-Optimal Off-Policy TD Learning
Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions are optimal w.r.t approximating the true value function V . Two novel algorithms are proposed to approximate the true value function V . This paper makes the following contributions: • A batch algorit...
متن کاملTrue Online Emphatic TD(λ): Quick Reference and Implementation Guide
TD(λ) is the core temporal-difference algorithm for learning general state-value functions (Sutton 1988, Singh & Sutton 1996). True online TD(λ) is an improved version incorporating dutch traces (van Seijen & Sutton 2014, van Seijen, Mahmood, Pilarski & Sutton 2015). Emphatic TD(λ) is another variant that includes an “emphasis algorithm” that makes it sound for off-policy learning (Sutton, Mahm...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1705.04185 شماره
صفحات -
تاریخ انتشار 2017